AITopics

2511.0708

Country: Asia > Middle East (0.28)

Genre: Research Report (0.84)

Industry: Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.66)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.56)

arXiv.org Artificial IntelligenceAug-1-2025

DrugMCTS: a drug repurposing framework combining multi-agent, RAG and Monte Carlo Tree Search

Yang, Zerui, Wan, Yuwei, Yan, Siyu, Matsuda, Yudai, Xie, Tong, Hoex, Bram, Song, Linqi

Recent advances in large language models have demonstrated considerable potential in scientific domains such as drug repositioning. However, their effectiveness remains constrained when reasoning extends beyond the knowledge acquired during pre-training. Conventional approaches, such as fine-tuning or retrieval-augmented generation, face limitations in either imposing high computational overhead or failing to fully exploit structured scientific data. To overcome these challenges, we propose DrugM-CTS, a novel framework that synergistically integrates RAG, multi-agent collaboration, and Monte Carlo Tree Search for drug repositioning. The framework employs five specialized agents tasked with retrieving and analyzing molecular and protein information, thereby enabling structured and iterative reasoning. Extensive experiments on the DrugBank and KIBA datasets demonstrate that DrugMCTS achieves substantially higher recall and robustness compared to both general-purpose LLMs and deep learning baselines. Our results highlight the importance of structured reasoning, agent-based collaboration, and feedback-driven search mechanisms in advancing LLM applications for drug repositioning.

large language model, machine learning, natural language, (18 more...)

2507.07426

Country:

Asia > China (0.47)
Oceania > Australia > New South Wales (0.28)

Genre: Research Report > New Finding (0.66)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceDec-12-2024

TouchTTS: An Embarrassingly Simple TTS Framework that Everyone Can Touch

Song, Xingchen, Xing, Mengtao, Ma, Changwei, Li, Shengqiang, Wu, Di, Zhang, Binbin, Pan, Fuping, Zhou, Dinghao, Zhang, Yuekai, Lei, Shun, Peng, Zhendong, Wu, Zhiyong

It is well known that LLM-based systems are data-hungry. Recent LLM-based TTS works typically employ complex data processing pipelines to obtain high-quality training data. These sophisticated pipelines require excellent models at each stage (e.g., speech denoising, speech enhancement, speaker diarization, and punctuation models), which themselves demand high-quality training data and are rarely open-sourced. Even with state-of-the-art models, issues persist, such as incomplete background noise removal and misalignment between punctuation and actual speech pauses. Moreover, the stringent filtering strategies often retain only 10-30\% of the original data, significantly impeding data scaling efforts. In this work, we leverage a noise-robust audio tokenizer (S3Tokenizer) to design a simplified yet effective TTS data processing pipeline that maintains data quality while substantially reducing data acquisition costs, achieving a data retention rate of over 50\%. Beyond data scaling challenges, LLM-based TTS systems also incur higher deployment costs compared to conventional approaches. Current systems typically use LLMs solely for text-to-token generation, while requiring separate models (e.g., flow matching models) for token-to-waveform generation, which cannot be directly executed by LLM inference engines, further complicating deployment. To address these challenges, we eliminate redundant modules in both LLM and flow components, replacing the flow model backbone with an LLM architecture. Building upon this simplified flow backbone, we propose a unified architecture for both streaming and non-streaming inference, significantly reducing deployment costs. Finally, we explore the feasibility of unifying TTS and ASR tasks using the same data for training, thanks to the simplified pipeline and the S3Tokenizer that reduces the quality requirements for TTS training data.

arxiv preprint arxiv, large language model, machine learning, (16 more...)

2412.08237

Country: Oceania > New Zealand (0.04)

Genre: Research Report (1.00)

Industry: Information Technology (0.70)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

arXiv.org Artificial IntelligenceOct-10-2024

Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content

Wang, Qiuheng, Shi, Yukai, Ou, Jiarong, Chen, Rui, Lin, Ke, Wang, Jiahao, Jiang, Boyuan, Yang, Haotian, Zheng, Mingwu, Tao, Xin, Yang, Fei, Wan, Pengfei, Zhang, Di

As visual generation technologies continue to advance, the scale of video datasets has expanded rapidly, and the quality of these datasets is critical to the performance of video generation models. We argue that temporal splitting, detailed captions, and video quality filtering are three key factors that determine dataset quality. However, existing datasets exhibit various limitations in these areas. To address these challenges, we introduce Koala-36M, a large-scale, high-quality video dataset featuring accurate temporal splitting, detailed captions, and superior video quality. The core of our approach lies in improving the consistency between fine-grained conditions and video content. Specifically, we employ a linear classifier on probability distributions to enhance the accuracy of transition detection, ensuring better temporal consistency. We then provide structured captions for the splitted videos, with an average length of 200 words, to improve text-video alignment. Additionally, we develop a Video Training Suitability Score (VTSS) that integrates multiple sub-metrics, allowing us to filter high-quality videos from the original corpus. Finally, we incorporate several metrics into the training process of the generation model, further refining the fine-grained conditions. Our experiments demonstrate the effectiveness of our data processing pipeline and the quality of the proposed Koala-36M dataset. Our dataset and code will be released at https://koala36m.github.io/.

large language model, machine learning, natural language, (16 more...)

2410.0826

Country:

Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > China > Guangdong Province > Shenzhen (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

arXiv.org Artificial IntelligenceMar-17-2024

WanJuan-CC: A Safe and High-Quality Open-sourced English Webtext Dataset

Qiu, Jiantao, Lv, Haijun, Jin, Zhenjiang, Wang, Rui, Ning, Wenchang, Yu, Jia, Zhang, ChaoBin, Li, Zhenxiang, Chu, Pei, Qu, Yuan, Shi, Jin, Lu, Lindong, Peng, Runyu, Zeng, Zhiyuan, Tang, Huanze, Lei, Zhikai, Hong, Jiawei, Chen, Keyu, Fei, Zhaoye, Xu, Ruiliang, Li, Wei, Tu, Zhongying, Dahua, Lin, Qiao, Yu, Yan, Hang, He, Conghui

This paper presents WanJuan-CC, a safe and high-quality open-sourced English webtext dataset derived from Common Crawl data. The study addresses the challenges of constructing large-scale pre-training datasets for language models, which require vast amounts of high-quality data. A comprehensive process was designed to handle Common Crawl data, including extraction, heuristic rule filtering, fuzzy deduplication, content safety filtering, and data quality filtering. From approximately 68 billion original English documents, we obtained 2.22T Tokens of safe data and selected 1.0T Tokens of high-quality data as part of WanJuan-CC. We have open-sourced 100B Tokens from this dataset. The paper also provides statistical information related to data quality, enabling users to select appropriate data according to their needs. To evaluate the quality and utility of the dataset, we trained 1B-parameter and 3B-parameter models using WanJuan-CC and another dataset, RefinedWeb. Results show that WanJuan-CC performs better on validation datasets and downstream tasks.

data quality, dataset, language model, (15 more...)

2402.19282

Country:

Asia > Middle East > Jordan (0.04)
Asia > China > Shanghai > Shanghai (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(3 more...)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (0.68)

Technology:

Information Technology > Data Science > Data Quality (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
(2 more...)

arXiv.org Artificial IntelligenceDec-22-2023

YAYI 2: Multilingual Open-Source Large Language Models

Luo, Yin, Kong, Qingchao, Xu, Nan, Cao, Jia, Hao, Bao, Qu, Baoyu, Chen, Bo, Zhu, Chao, Zhao, Chenyang, Zhang, Donglei, Feng, Fan, Zhao, Feifei, Sun, Hailong, Yang, Hanxuan, Pan, Haojun, Liu, Hongyu, Guo, Jianbin, Du, Jiangtao, Wang, Jingyi, Li, Junfeng, Sun, Lei, Liu, Liduo, Dong, Lifeng, Liu, Lili, Wang, Lin, Zhang, Liwen, Wang, Minzheng, Wang, Pin, Yu, Ping, Li, Qingxiao, Yan, Rui, Zou, Rui, Li, Ruiqun, Huang, Taiwen, Wang, Xiaodong, Wu, Xiaofei, Peng, Xin, Zhang, Xina, Fang, Xing, Xiao, Xinglin, Hao, Yanni, Dong, Yao, Wang, Yigang, Liu, Ying, Jiang, Yongyu, Wang, Yungan, Wang, Yuqi, Wang, Zhangsheng, Yu, Zhaoxin, Luo, Zhen, Mao, Wenji, Wang, Lei, Zeng, Dajun

As the latest advancements in natural language processing, large language models (LLMs) have achieved human-level language understanding and generation abilities in many real-world tasks, and even have been regarded as a potential path to the artificial general intelligence. To better facilitate research on LLMs, many open-source LLMs, such as Llama 2 and Falcon, have recently been proposed and gained comparable performances to proprietary models. However, these models are primarily designed for English scenarios and exhibit poor performances in Chinese contexts. In this technical report, we propose YAYI 2, including both base and chat models, with 30 billion parameters. YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline. The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback. Extensive experiments on multiple benchmarks, such as MMLU and CMMLU, consistently demonstrate that the proposed YAYI 2 outperforms other similar sized open-source models.

benchmark, information, yayi 2, (16 more...)

2312.14862

Country:

North America > United States > California > San Diego County > San Diego (0.04)
Asia > Japan > Kyūshū & Okinawa > Kyūshū (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.82)

Industry:

Information Technology > Security & Privacy (1.00)
Law (0.93)
Education (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

#artificialintelligenceMay-25-2022, 21:23:08 GMT

Event-Driven Scalability in Data Processing Pipeline

Building a data processing pipeline is one of the most common problem statements, for which you would have written small scripts or built a full-fledged scalable system based on the amount, and frequency of data. In this article, we will talk about the idea of event-driven scalability, the backbone that will be cost-optimized, and requires a minimum amount of development and operations. Why build an event-driven and scalable data processing pipeline? While working with startups or building a team project or a personal project, that requires a pipeline for data processing, there is always a constraint of cost. We will use a simple example for building a metadata extraction system for e-commerce products.

architecture, data processing pipeline, requirement, (10 more...)

Industry: Information Technology > Software (1.00)

Technology: Information Technology > Artificial Intelligence (0.36)

#artificialintelligenceMay-20-2022, 16:35:48 GMT

Operationalizing Machine Learning from PoC to Production - KDnuggets

Many companies use machine learning to help create a differentiator and grow their business. However, it's not easy to make machine learning work as it requires a balance between research and engineering. One can come up with a good innovative solution based on current research, but it might not go live due to engineering inefficiencies, cost and complexity. Most companies haven't seen much ROI from machine learning since the benefit is realized only when the models are in production. Let's dive into the challenges and best practices that one can follow to make machine learning work.

accuracy, complexity, operationalizing machine learning, (12 more...)

Country: Asia > India > Karnataka > Bengaluru (0.05)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)

#artificialintelligenceNov-6-2018, 07:15:05 GMT

NVIDIA Brings The Power Of GPU To Data Processing Pipelines

NVIDIA has launched an open source project called Real-time Acceleration Platform for Integrated Data Science (RAPIDS) that aims to deliver end-to-end data science infrastructure based on GPUs. GPU-backed machines play an essential role in generating machine learning models. Data scientists run training jobs that are computationally intensive on GPUs. Massive datasets that are converted into complex matrices of numbers are used as an input for machine learning and deep learning models. During the training process, these matrices are added, multiplied and subtracted from other complex matrices.

artificial intelligence, deep learning, machine learning, (14 more...)

Industry: Information Technology > Hardware (0.68)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

#artificialintelligenceSep-23-2017, 05:55:06 GMT

Data Quality in the era of A.I. – Towards Data Science – Medium

As the director of datamine decision support systems, I've delivered more than 80 data-intensive projects -- including data warehousing, data integration, business intelligence, content performance and predictive models -- across several industries and high-profile corporations. In most cases, data quality proved to be a critical success factor. The obvious challenge in every case was to effectively query heterogeneous data sources, then extract and transform data towards one or more data models. The non-obvious challenge was the early identification of data issues, which in most cases were unknown to the data owners as well. There are many aspects to data quality, including consistency, integrity, accuracy, and completeness.

data mining, data quality, quality reference store, (16 more...)

Technology:

Information Technology > Data Science > Data Quality (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Information Fusion (0.54)